Skip to content

feat: WOS structured author extraction with S2/MinerU merge fallback#14

Open
Caorui-Li wants to merge 1 commit into
VisionXLab:v2from
Caorui-Li:carol-0511--author-mes
Open

feat: WOS structured author extraction with S2/MinerU merge fallback#14
Caorui-Li wants to merge 1 commit into
VisionXLab:v2from
Caorui-Li:carol-0511--author-mes

Conversation

@Caorui-Li
Copy link
Copy Markdown

Summary

  • Add structured_author_fetcher.py: WOS Starter API as primary Phase 2 author source, with intelligent fallback chain (WOS+PDF → WOS+S2 → S2+collector)
  • Add author_name_utils.py: robust name normalization and 5-rule fuzzy matching across WOS/S2/PDF formats (handles initials, accents, inverted order)
  • Merge strategy: WOS+PDF preferred → WOS+S2 when PDF unavailable → pure S2+collector fallback (identical to before when WOS not configured)
  • Add WOS API Key field to UI config with save/load wiring; fix ConfigUpdate missing wos_api_key so key was silently discarded
  • Fix citation count parsing for European locale formats (5.710, 5 710)
  • Add per-paper author debug logging and _author_debug.json output for testing
  • MinerU PDF parsing changed from first-page only to full document; prompt updated to prohibit fabrication/reference mixing

AI coding brief

Original request: Integrate WOS Starter API as structured author extraction for Phase 2. WOS gives accurate author lists but no affiliations — affiliations come from PDF (MinerU) when available, otherwise S2. When WOS is not configured, behavior should be identical to before.

Manual interventions:

  • Corrected merge priority: WOS+PDF > WOS+S2 > S2 (not always-query-both)
  • Fixed wos_api_key missing from ConfigUpdate in main.py — key was silently discarded on UI save
  • Fixed browser cache causing old HTML (without WOS field) to be served — required hard refresh
  • Clarified that actual Phase 2 pipeline is task_executor._run_new_phase2_and_3, not author_searcher.py

Retro: Specify upfront which code path is the production pipeline vs legacy/dead code. Also define multi-source merge priority rules before implementation to avoid rework.

Test plan

  • Fill WOS API Key in UI → save → verify config.json persists key
  • Run pipeline: Phase 2 log shows 📋 WOS 结构化作者提取已启用(WOS + S2 双源融合)
  • With PDF: source label wos+pdf, affiliations from PDF
  • Without PDF: source label wos+s2, affiliations from S2 where name-matched
  • Without WOS key: log shows ⚪ 未配置 WOS Key, behavior identical to before
  • Check _author_debug.json in result folder for per-paper breakdown
  • European locale citation counts (e.g. 5.710) parsed correctly

🤖 Generated with Claude Code

Add Web of Science Starter API as the primary source for Phase 2 author
extraction, with intelligent fallback and cross-source affiliation merging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant